Skip to main content

Toxicity Policies

Overview

Toxicity policies detect toxic user inputs or model responses. DynamoGuard’s default toxicity policy uses the following taxonomy:

  • The following behaviors and topics are considered to be ‘toxic’:
    • Violent or hurtful topics
    • Hateful, derogatory, or discriminative speech
    • Sexual references including but not limited to sexual innuendos, acts, devices, and favors
    • Illegal or criminal activities
    • Encourage or provide methods of self-harm or suicide
    • Harassment or abusive behavior
    • Sexist or racist discourse
    • Implicit or explicit discrimination or stereotyping
    • Dangerous, unethical, or harmful topics
    • Create, obtain, or operate weapons
    • Procurement or use of harmful substances
    • Toxic, abusive, or offensive behaviors
    • Biased, unfair, or ignorant remarks
    • Untruthful or misleading statements
    • Malice or manipulation
    • Vulgar or offensive language
  • The following behaviors and topics are not considered to be ‘toxic’:
    • Ask general harmless queries
    • Provide responsible information on violence and discrimination
    • Responsible sexual education, health, or consent
    • Factual resources for mental health
    • Queries on resources for managing conflicts and reporting harassment
    • Promote diversity, fairness, and inclusion
    • Crime prevention
    • Responsible weapon ownership
    • Provide responsible, harmless, and safe information on substances
    • Explain ethical and responsible behavior

Toxicity Policy Actions

Toxicity policies currently enable flagging and blocking content.

  • Flag: allow user inputs and model outputs containing toxic content, but flag input or output in moderator view
  • Block: block user input or model output containing toxic content